In [1]:

    
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(os.path.join("..", "..")))

Chicago data

The data can be downloaded from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 (see the module docstring of open_cp.sources.chicago See also https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

In this notebook, we quickly look at the data, check that the data agrees between both sources, and demo some of the library features provided for loading the data.



In [2]:

    
import open_cp.sources.chicago as chicago
import geopandas as gpd

import sys, os, csv, lzma
filename = os.path.join("..", "..", "open_cp", "sources", "chicago.csv")
filename_all = os.path.join("..", "..", "open_cp", "sources", "chicago_all.csv.xz")
filename_all1 = os.path.join("..", "..", "open_cp", "sources", "chicago_all1.csv.xz")

Let us look at the snapshot of the last year, vs the total dataset. The data appears to be the same, though the exact format changes.



In [3]:

    
with open(filename, "rt") as file:
    reader = csv.reader(file)
    print(next(reader))
    print(next(reader))









    



['CASE#', 'DATE  OF OCCURRENCE', 'BLOCK', ' IUCR', ' PRIMARY DESCRIPTION', ' SECONDARY DESCRIPTION', ' LOCATION DESCRIPTION', 'ARREST', 'DOMESTIC', 'BEAT', 'WARD', 'FBI CD', 'X COORDINATE', 'Y COORDINATE', 'LATITUDE', 'LONGITUDE', 'LOCATION']
['HZ560767', '12/22/2016 02:55:00 AM', '010XX N CENTRAL PARK AVE', '4387', 'OTHER OFFENSE', 'VIOLATE ORDER OF PROTECTION', 'APARTMENT', 'N', 'Y', '1112', '27', '26', '1152189', '1906649', '41.899712716', '-87.716454159', '(41.899712716, -87.716454159)']



In [4]:

    
with lzma.open(filename_all, "rt") as file:
    reader = csv.reader(file)
    print(next(reader))
    print(next(reader))









    



['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location']
['8651563', 'HV322174', '06/05/2012 11:00:00 AM', '022XX N CANNON DR', '0810', 'THEFT', 'OVER $500', 'STREET', 'false', 'false', '1814', '018', '43', '7', '06', '1175057', '1915111', '2012', '02/04/2016 06:33:39 AM', '41.922450893', '-87.632206293', '(41.922450893, -87.632206293)']

As well as loading data directly into a TimedPoints class, we can process a sub-set of the data to GeoJSON, or straight to a geopandas dataframe (if geopandas is installed).



In [5]:

    
geo_data = chicago.load_to_GeoJSON()
geo_data[0]









    Out[5]:





{'geometry': {'coordinates': [-87.716454159, 41.899712716], 'type': 'Point'},
 'properties': {'address': '010XX N CENTRAL PARK AVE',
  'case': 'HZ560767',
  'crime': 'OTHER OFFENSE',
  'location': 'APARTMENT',
  'timestamp': '2016-12-22T02:55:00',
  'type': 'VIOLATE ORDER OF PROTECTION'},
 'type': 'Feature'}



In [6]:

    
frame = chicago.load_to_geoDataFrame()
frame.head()









    Out[6]:






  
    
      
      address
      case
      crime
      geometry
      location
      timestamp
      type
    
  
  
    
      0
      010XX N CENTRAL PARK AVE
      HZ560767
      OTHER OFFENSE
      POINT (-87.71645415899999 41.899712716)
      APARTMENT
      2016-12-22T02:55:00
      VIOLATE ORDER OF PROTECTION
    
    
      1
      051XX S WASHTENAW AVE
      HZ561134
      BATTERY
      POINT (-87.691539994 41.800445234)
      RESIDENTIAL YARD (FRONT/BACK)
      2016-12-22T11:17:00
      AGGRAVATED: OTHER FIREARM
    
    
      2
      059XX W DIVERSEY AVE
      HZ565584
      DECEPTIVE PRACTICE
      POINT (-87.774165121 41.931166274)
      RESIDENCE
      2016-12-09T12:00:00
      FINANCIAL IDENTITY THEFT $300 AND UNDER
    
    
      3
      001XX N STATE ST
      HZ561772
      THEFT
      POINT (-87.62787669799999 41.883500187)
      DEPARTMENT STORE
      2016-12-22T18:50:00
      RETAIL THEFT
    
    
      4
      008XX N MICHIGAN AVE
      HZ561969
      THEFT
      POINT (-87.624095634 41.897982937)
      SMALL RETAIL STORE
      2016-12-22T19:20:00
      RETAIL THEFT

Explore with QGIS

We can save the dataframe to a shape-file which can be viewed in e.g. QGIS.

To explore the spatial-distribution, I would recommend using an interactive GIS package. Using QGIS (free and open source) you can easily add a basemap using GoogleMaps or OpenStreetMap, etc. See http://maps.cga.harvard.edu/qgis/wkshop/basemap.php

I found this to be slightly buggy. On Windows, QGIS 2.18.7 I found that the following worked:

First open the chicago.shp file produced from the line above.
Select the Coordinate reference system "WGS 84 / EPSG:4326"
Now go to the menu "Web" -> "OpenLayers plugin" -> Whatever
The projection should change to EPSG:3857. The basemap will obscure the point map, so in the "Layers Panel" drag the basemap to the bottom.
Selecting EPSG:3857 at import time doesn't seem to work (which is different to the instructions..!)



In [7]:

    
# On my Windows install, if I don't do this, I get a GDAL error in
# the Jupyter console, and the resulting ".prj" file is empty.
# This isn't critical, but it confuses QGIS, and you end up having to
# choose a projection when loading the shape-file.
import os
os.environ["GDAL_DATA"] = "C:\\Users\\Matthew\\Anaconda3\\Library\\share\\gdal\\"

frame.to_file("chicago")

A geoPandas example

Let's use the "generator of GeoJSON" option shown above to pick out only BURGLARY crimes from the 2001-- dataset (which is too large to easily load into a dataframe in one go).



In [8]:

    
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["crime"] == "THEFT" ]
    
frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()









    Out[8]:






  
    
      
      address
      case
      crime
      geometry
      location
      timestamp
      type
    
  
  
    
      0
      022XX N CANNON DR
      HV322174
      THEFT
      POINT (-87.632206293 41.922450893)
      STREET
      2012-06-05T11:00:00
      OVER $500
    
    
      1
      077XX S EUCLID AVE
      HV326989
      THEFT
      POINT (-87.57727330199999 41.754642143)
      RESIDENCE
      2012-05-01T01:00:00
      AGG: FINANCIAL ID THEFT
    
    
      2
      068XX S WASHTENAW AVE
      HV326720
      THEFT
      POINT (-87.690703078 41.769306564)
      APARTMENT
      2012-06-09T02:00:00
      FROM BUILDING
    
    
      3
      019XX N CICERO AVE
      HV326202
      THEFT
      POINT (-87.746056778 41.915729613)
      DEPARTMENT STORE
      2012-06-08T16:30:00
      $500 AND UNDER
    
    
      4
      034XX N HALSTED ST
      HV327230
      THEFT
      POINT (-87.64941621200001 41.944953005)
      BAR OR TAVERN
      2012-06-10T02:30:00
      POCKET-PICKING



In [9]:

    
frame.to_file("chicago_all_theft")



In [10]:

    
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["crime"] == "BURGLARY" ]

frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()









    Out[10]:






  
    
      
      address
      case
      crime
      geometry
      location
      timestamp
      type
    
  
  
    
      0
      003XX W 94TH PL
      HV327134
      BURGLARY
      POINT (-87.632356731 41.723006487)
      RESIDENCE
      2012-06-09T23:30:00
      FORCIBLE ENTRY
    
    
      1
      011XX N HERMITAGE AVE
      HV327217
      BURGLARY
      POINT (-87.671145473 41.902514714)
      RESIDENCE
      2012-06-06T12:00:00
      FORCIBLE ENTRY
    
    
      2
      047XX W WEST END AVE
      HV327110
      BURGLARY
      POINT (-87.743935241 41.883144607)
      RESIDENCE
      2012-06-09T01:30:00
      FORCIBLE ENTRY
    
    
      3
      038XX S SACRAMENTO AVE
      HV327153
      BURGLARY
      POINT (-87.69951882399999 41.823319195)
      APARTMENT
      2012-06-09T12:00:00
      UNLAWFUL ENTRY
    
    
      4
      013XX W 91ST ST
      HV327223
      BURGLARY
      POINT (-87.65724045 41.72858934)
      RESIDENCE
      2012-06-09T02:00:00
      FORCIBLE ENTRY



In [11]:

    
frame.to_file("chicago_all_burglary")



In [12]:

    
frame["type"].unique()









    Out[12]:





array(['FORCIBLE ENTRY', 'UNLAWFUL ENTRY', 'ATTEMPT FORCIBLE ENTRY',
       'HOME INVASION'], dtype=object)



In [13]:

    
frame["location"].unique()









    Out[13]:





array(['RESIDENCE', 'APARTMENT', 'DEPARTMENT STORE', 'RESIDENCE-GARAGE',
       'SCHOOL, PUBLIC, BUILDING', 'SMALL RETAIL STORE',
       'CONSTRUCTION SITE', 'CHA PARKING LOT/GROUNDS',
       'CHURCH/SYNAGOGUE/PLACE OF WORSHIP', 'APPLIANCE STORE',
       'RESTAURANT', 'CHA APARTMENT', 'CLEANING STORE',
       'GROCERY FOOD STORE', 'OTHER', 'DAY CARE CENTER',
       'RESIDENTIAL YARD (FRONT/BACK)', 'COMMERCIAL / BUSINESS OFFICE',
       'ABANDONED BUILDING', 'BARBERSHOP',
       'PARKING LOT/GARAGE(NON.RESID.)', 'WAREHOUSE', 'VACANT LOT/LAND',
       'BAR OR TAVERN', 'PARK PROPERTY', 'CTA GARAGE / OTHER PROPERTY',
       'RESIDENCE PORCH/HALLWAY', 'BANK', 'FEDERAL BUILDING', 'STREET',
       'OTHER RAILROAD PROP / TRAIN DEPOT', 'GAS STATION',
       'MEDICAL/DENTAL OFFICE', 'CAR WASH', 'MOVIE HOUSE/THEATER',
       'CONVENIENCE STORE', 'HOTEL/MOTEL', 'DRIVEWAY - RESIDENTIAL',
       'SCHOOL, PRIVATE, BUILDING', 'ALLEY', 'SCHOOL, PUBLIC, GROUNDS',
       'CURRENCY EXCHANGE', 'HOSPITAL BUILDING/GROUNDS',
       'COLLEGE/UNIVERSITY GROUNDS', 'FACTORY/MANUFACTURING BUILDING',
       'TAVERN/LIQUOR STORE', 'GOVERNMENT BUILDING/PROPERTY',
       'VEHICLE NON-COMMERCIAL', 'LIBRARY', 'VEHICLE-COMMERCIAL',
       'POOL ROOM', 'SCHOOL, PRIVATE, GROUNDS',
       'LAKEFRONT/WATERFRONT/RIVERBANK', 'DRUG STORE', 'CEMETARY',
       'NURSING HOME/RETIREMENT HOME', 'SIDEWALK',
       'COLLEGE/UNIVERSITY RESIDENCE HALL', 'ATHLETIC CLUB',
       'BOAT/WATERCRAFT', 'ANIMAL HOSPITAL', 'COIN OPERATED MACHINE',
       'FIRE STATION', 'CHA HALLWAY/STAIRWELL/ELEVATOR', 'FOREST PRESERVE',
       'OTHER COMMERCIAL TRANSPORTATION', 'PAWN SHOP',
       'SPORTS ARENA/STADIUM', 'BOWLING ALLEY',
       'POLICE FACILITY/VEH PARKING LOT', 'SAVINGS AND LOAN',
       'CTA PLATFORM', 'CREDIT UNION', 'ATM (AUTOMATIC TELLER MACHINE)',
       'NEWSSTAND', 'CTA TRACKS - RIGHT OF WAY', 'CTA STATION', '',
       'HIGHWAY/EXPRESSWAY', 'AIRPORT/AIRCRAFT', 'BRIDGE',
       'DELIVERY TRUCK', 'AIRPORT TERMINAL UPPER LEVEL - SECURE AREA',
       'CTA BUS', 'TAXICAB', 'VEHICLE - OTHER RIDE SERVICE',
       'AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA', 'CTA BUS STOP',
       'CTA TRAIN', 'AIRPORT EXTERIOR - NON-SECURE AREA'], dtype=object)

Upon loading into QGIS to visualise, we find that the 2001 data seems to be geocoded in a different way... The events are not on the road, and the distribution looks less artificial. Let's extract the 2001 burglary data, and then the all the 2001 data, and save.



In [14]:

    
with lzma.open(filename_all, "rt") as file:
    features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
                if event["properties"]["timestamp"].startswith("2001") ]

frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()









    Out[14]:






  
    
      
      address
      case
      crime
      geometry
      location
      timestamp
      type
    
  
  
    
      0
      079XX S CAMPBELL AVE
      G397434
      HOMICIDE
      POINT (-87.685268108 41.749135591)
      STREET
      2001-08-09T19:30:00
      FIRST DEGREE MURDER
    
    
      1
      048XX W KAMERLING AVE
      G473112
      HOMICIDE
      POINT (-87.746889161 41.905072512)
      STREET
      2001-08-10T01:04:00
      FIRST DEGREE MURDER
    
    
      2
      009XX N LAMON AVE
      G477463
      HOMICIDE
      POINT (-87.74830887500001 41.897236716)
      STREET
      2001-08-11T23:06:00
      FIRST DEGREE MURDER
    
    
      3
      036XX W OHIO ST
      G476774
      HOMICIDE
      POINT (-87.717959287 41.891770726)
      APARTMENT
      2001-08-12T18:17:00
      FIRST DEGREE MURDER
    
    
      4
      019XX S RACINE AVE
      G477822
      HOMICIDE
      POINT (-87.656374568 41.85626764)
      AUTO
      2001-08-12T04:35:00
      FIRST DEGREE MURDER



In [15]:

    
frame.to_file("chicago_2001")

Explore rounding errors

We check the following:

The X and Y COORDINATES fields (which we'll see, in a different notebook, at longitude / latitude coordinates projected in EPSG:3435 in feet) are always whole numbers.
The longitude and latitude data contains at most 9 decimals places of accuracy.

In the other notebook, we look at map projections. The data is most consistent with the longitude / latitude coordinates being the primary source, and the X/Y projected coordinates being computed and rounded to the nearest integer.



In [16]:

    
longs, lats = [], []
xcs, ycs = [], []

with open(filename, "rt") as file:
    reader = csv.reader(file)
    header = next(reader)
    print(header)
    for row in reader:
        if len(row[14]) > 0:
            longs.append(row[14])
            lats.append(row[15])
            xcs.append(row[12])
            ycs.append(row[13])









    



['CASE#', 'DATE  OF OCCURRENCE', 'BLOCK', ' IUCR', ' PRIMARY DESCRIPTION', ' SECONDARY DESCRIPTION', ' LOCATION DESCRIPTION', 'ARREST', 'DOMESTIC', 'BEAT', 'WARD', 'FBI CD', 'X COORDINATE', 'Y COORDINATE', 'LATITUDE', 'LONGITUDE', 'LOCATION']



In [17]:

    
set(len(x) for x in longs), set(len(x) for x in lats)









    Out[17]:





({8, 9, 10, 11, 12}, {8, 9, 10, 11, 12, 13})



In [18]:

    
any(x.find('.') >= 0 for x in xcs), any(y.find('.') >= 0 for y in ycs)









    Out[18]:





(False, False)

Repeated data

Mostly the "case" assignment is unique, but there are a few exceptions to this.



In [ ]:

    
import collections

with lzma.open(filename_all, "rt") as file:
    c = collections.Counter( event["properties"]["case"] for event in
                            chicago.generate_GeoJSON_Features(file, type="all") )
    
multiples = set( key for key in c if c[key] > 1 )
len(multiples)



In [ ]:

    
with lzma.open(file_all, "rt") as file:
    data = gpd.GeoDataFrame.from_features(
        event for event in chicago.generate_GeoJSON_Features(file, type="all")
        if event["properties"]["case"] in multiples
        )
    
len(data), len(data.case.uniques())

	address	case	crime	geometry	location	timestamp	type
0	010XX N CENTRAL PARK AVE	HZ560767	OTHER OFFENSE	POINT (-87.71645415899999 41.899712716)	APARTMENT	2016-12-22T02:55:00	VIOLATE ORDER OF PROTECTION
1	051XX S WASHTENAW AVE	HZ561134	BATTERY	POINT (-87.691539994 41.800445234)	RESIDENTIAL YARD (FRONT/BACK)	2016-12-22T11:17:00	AGGRAVATED: OTHER FIREARM
2	059XX W DIVERSEY AVE	HZ565584	DECEPTIVE PRACTICE	POINT (-87.774165121 41.931166274)	RESIDENCE	2016-12-09T12:00:00	FINANCIAL IDENTITY THEFT $300 AND UNDER
3	001XX N STATE ST	HZ561772	THEFT	POINT (-87.62787669799999 41.883500187)	DEPARTMENT STORE	2016-12-22T18:50:00	RETAIL THEFT
4	008XX N MICHIGAN AVE	HZ561969	THEFT	POINT (-87.624095634 41.897982937)	SMALL RETAIL STORE	2016-12-22T19:20:00	RETAIL THEFT

	address	case	crime	geometry	location	timestamp	type
0	022XX N CANNON DR	HV322174	THEFT	POINT (-87.632206293 41.922450893)	STREET	2012-06-05T11:00:00	OVER $500
1	077XX S EUCLID AVE	HV326989	THEFT	POINT (-87.57727330199999 41.754642143)	RESIDENCE	2012-05-01T01:00:00	AGG: FINANCIAL ID THEFT
2	068XX S WASHTENAW AVE	HV326720	THEFT	POINT (-87.690703078 41.769306564)	APARTMENT	2012-06-09T02:00:00	FROM BUILDING
3	019XX N CICERO AVE	HV326202	THEFT	POINT (-87.746056778 41.915729613)	DEPARTMENT STORE	2012-06-08T16:30:00	$500 AND UNDER
4	034XX N HALSTED ST	HV327230	THEFT	POINT (-87.64941621200001 41.944953005)	BAR OR TAVERN	2012-06-10T02:30:00	POCKET-PICKING

	address	case	crime	geometry	location	timestamp	type
0	003XX W 94TH PL	HV327134	BURGLARY	POINT (-87.632356731 41.723006487)	RESIDENCE	2012-06-09T23:30:00	FORCIBLE ENTRY
1	011XX N HERMITAGE AVE	HV327217	BURGLARY	POINT (-87.671145473 41.902514714)	RESIDENCE	2012-06-06T12:00:00	FORCIBLE ENTRY
2	047XX W WEST END AVE	HV327110	BURGLARY	POINT (-87.743935241 41.883144607)	RESIDENCE	2012-06-09T01:30:00	FORCIBLE ENTRY
3	038XX S SACRAMENTO AVE	HV327153	BURGLARY	POINT (-87.69951882399999 41.823319195)	APARTMENT	2012-06-09T12:00:00	UNLAWFUL ENTRY
4	013XX W 91ST ST	HV327223	BURGLARY	POINT (-87.65724045 41.72858934)	RESIDENCE	2012-06-09T02:00:00	FORCIBLE ENTRY

	address	case	crime	geometry	location	timestamp	type
0	079XX S CAMPBELL AVE	G397434	HOMICIDE	POINT (-87.685268108 41.749135591)	STREET	2001-08-09T19:30:00	FIRST DEGREE MURDER
1	048XX W KAMERLING AVE	G473112	HOMICIDE	POINT (-87.746889161 41.905072512)	STREET	2001-08-10T01:04:00	FIRST DEGREE MURDER
2	009XX N LAMON AVE	G477463	HOMICIDE	POINT (-87.74830887500001 41.897236716)	STREET	2001-08-11T23:06:00	FIRST DEGREE MURDER
3	036XX W OHIO ST	G476774	HOMICIDE	POINT (-87.717959287 41.891770726)	APARTMENT	2001-08-12T18:17:00	FIRST DEGREE MURDER
4	019XX S RACINE AVE	G477822	HOMICIDE	POINT (-87.656374568 41.85626764)	AUTO	2001-08-12T04:35:00	FIRST DEGREE MURDER